Use the select function to select variables (columns) from a tibble.

Given a tibble select can be used to :

Let’s take the pulse dataset:

pulse 
# A tibble: 110 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 2 1993_B Mela…    179     58    19 female no     yes     modera… ran       82    150
 3 1993_C Cons…    167     62    18 female no     yes     high    ran       96    176
 4 1993_D Trav…    195     84    18 male   no     yes     high    sat       71     73
 5 1993_E Lauri    173     64    18 female no     yes     low     sat       90     88
 6 1993_F Geor…    184     74    22 male   no     yes     low     ran       78    141
 7 1993_G Cher…    162     57    20 female no     yes     modera… sat       68     72
 8 1993_H Fran…    169     55    18 female no     yes     modera… sat       71     77
 9 1993_I Sonja    164     56    19 female no     yes     high    sat       68     68
10 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

select takes as it first argument a tibble, followed by a comma separated list of variables of your choice and returns a tibble with those chosen variables:

select(pulse, name, age)
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

After this selection, does pulse tibble still contain the variables ‘name’ and ‘age’?

Yes, ‘select’ returns the selection as a tibble and does not modify the underlying tibble. You can check this by entering ‘pulse’ in the R console.


If you want to keep your selection as a separate tibble you’ll need to assign the result into a new environment variable, e.g. pulse_name_age:

pulse_name_age <- select(pulse, name, age)
pulse_name_age
# A tibble: 110 × 2
   name        age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

Variable order

The order of the selected variables is reflected in the resulting tibble:

select(pulse, age, name )
# A tibble: 110 × 2
     age name     
   <dbl> <chr>    
 1    18 Bonnie   
 2    19 Melanie  
 3    18 Consuelo 
 4    18 Travis   
 5    18 Lauri    
 6    22 George   
 7    20 Cherry   
 8    18 Francesca
 9    19 Sonja    
10    23 Troy     
# … with 100 more rows

Deselect variables

You may also deselect variables, with other words the complement of your selection. This is done by the - sign:

select(pulse, -smokes, -alcohol)
# A tibble: 110 × 11
   id     name      height weight   age gender exercise ran   pulse1 pulse2  year
   <chr>  <chr>      <dbl>  <dbl> <dbl> <chr>  <chr>    <chr>  <dbl>  <dbl> <dbl>
 1 1993_A Bonnie       173     57    18 female moderate sat       86     88  1993
 2 1993_B Melanie      179     58    19 female moderate ran       82    150  1993
 3 1993_C Consuelo     167     62    18 female high     ran       96    176  1993
 4 1993_D Travis       195     84    18 male   high     sat       71     73  1993
 5 1993_E Lauri        173     64    18 female low      sat       90     88  1993
 6 1993_F George       184     74    22 male   low      ran       78    141  1993
 7 1993_G Cherry       162     57    20 female moderate sat       68     72  1993
 8 1993_H Francesca    169     55    18 female moderate sat       71     77  1993
 9 1993_I Sonja        164     56    19 female high     sat       68     68  1993
10 1993_J Troy         168     60    23 male   moderate ran       88    150  1993
# … with 100 more rows

Select and rename

With selection it is possible to change the variable names simultaneously:

select(pulse, FirstName = name, Age = age)
# A tibble: 110 × 2
   FirstName   Age
   <chr>     <dbl>
 1 Bonnie       18
 2 Melanie      19
 3 Consuelo     18
 4 Travis       18
 5 Lauri        18
 6 George       22
 7 Cherry       20
 8 Francesca    18
 9 Sonja        19
10 Troy         23
# … with 100 more rows

What is the variable name in the pulse dataset, ‘Age’ or ‘age’?

age, this because we only run select and do not store its result with assignment (‘<-’) back into pulse tibble.


Reshuffle variables

With select we can reshuffle the variables in their positions in the tibble. When a data set contains large number of variables, you may want to bring the more ‘important’ variables in front for inspection. You can do this with select in combination with a helper function evertything():

select(pulse, name, age, everything()) 
# A tibble: 110 × 13
   name     age id    height weight gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <dbl> <chr>  <dbl>  <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 Bonnie    18 1993…    173     57 female no     yes     modera… sat       86     88
 2 Melan…    19 1993…    179     58 female no     yes     modera… ran       82    150
 3 Consu…    18 1993…    167     62 female no     yes     high    ran       96    176
 4 Travis    18 1993…    195     84 male   no     yes     high    sat       71     73
 5 Lauri     18 1993…    173     64 female no     yes     low     sat       90     88
 6 George    22 1993…    184     74 male   no     yes     low     ran       78    141
 7 Cherry    20 1993…    162     57 female no     yes     modera… sat       68     72
 8 Franc…    18 1993…    169     55 female no     yes     modera… sat       71     77
 9 Sonja     19 1993…    164     56 female no     yes     high    sat       68     68
10 Troy      23 1993…    168     60 male   no     yes     modera… ran       88    150
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

everything function lists all other variable other than name and age and select function places them after name and age.



Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC